2  Aligning Column Names

We brought in the raw datasets provided and first ensured that all column names referring to the same variables were consistent across the three databases, using the provided codebook as a reference. Standardizing these names improved both the efficiency of subsequent analyses and the clarity of the data dictionary developed later in this report.

To facilitate merging, we added a county field to the Los Angeles County database so that both datasets share the same set of columns, allowing them to be joined or appended to create a single statewide morbidity dataset for later analyses.

After completing these adjustments — renaming columns in the Los Angeles County database and adding the county field — the remaining tasks involve reconciling the age category and race/ethnicity variable, and standardizing how the timing of infection is identified using MMWR weeks (see Figure 1 below).

Code
#- Raw, provided datasets:
raw_ca_df <- read.csv(file = here("_data/raw_data/raw_ca_df.csv"))
raw_la_cnty_df <- read.csv(file = here("_data/raw_data/raw_la_cnty_df.csv"))
raw_pop_df <- read.csv(file = here("_data/raw_data/raw_pop_df.csv"))

##-- California dataset:
ca_df <- raw_ca_df %>% select(-dt_diagnosis) %>%
  mutate(county = clean(county) %>% 
          str_remove("\\s*[Cc]ounty$") %>% 
          str_trim())

##-- LA county dataset:
la_cnty_df <- raw_la_cnty_df %>%
  rename("age_cat" = "age_category", 
        "race_ethnicity" = "race_eth",
        "new_infections" = "dx_new",
        "cumulative_infected" = "infected_cumulative",
        "new_unrecovered" = "unrecovered_new",
        "cumulative_unrecovered" = "unrecovered_cumulative",
        "new_severe" = "severe_new",
        "cumulative_severe" = "severe_cumulative") %>%
  mutate(county = "Los Angeles") %>%
  relocate(county, .before = everything()) 

##-- Population dataset:
pop_df <- raw_pop_df %>% 
  rename("race_ethnicity" = "race7") %>%
  mutate(county = clean(county))

2.0.1 Figure 1. Aligning Column Names